A Scalable Approach to Harvest Modern Weblogs
نویسندگان
چکیده
Blogs are one of the most prominent means of communication on the web. Their content, interconnections and influence constitute a unique socio-technical artefact of our times which needs to be preserved. The BlogForever project has established best practices and developed an innovative system to harvest, preserve, manage and reuse blog content. This paper presents the latest developments of the blog crawler which is a key component of the BlogForever platform. More precisely, our work concentrates on techniques to automatically extract content such as articles, authors, dates and comments from blog posts. To achieve this goal, we introduce a simple yet robust and scalable algorithm to generate extraction rules based on string matching using the blog’s web feed in conjunction with blog hypertext. Furthermore, we present a system architecture which is characterised by efficiency, modularity, scalability and interoperability with third-party systems. Finally, we conduct thorough evaluations of the performance and accuracy of our system.
منابع مشابه
The Role of Weblogs in Iranian EFL Learners’ Vocabulary Knowledge and Writing Ability
Nowadays Information and Communications Technology (ICT) is becoming not only enormously popular but also increasingly important in our lives and education system. Generally, learners are usually eager to work on computers or with various kinds of modern technology. This research was carried out to find out whether using weblogs is effective in Iranian EFL learners’ vocabulary and writing skill...
متن کاملDynamic configuration and collaborative scheduling in supply chains based on scalable multi-agent architecture
Due to diversified and frequently changing demands from customers, technological advances and global competition, manufacturers rely on collaboration with their business partners to share costs, risks and expertise. How to take advantage of advancement of technologies to effectively support operations and create competitive advantage is critical for manufacturers to survive. To respond to these...
متن کاملIntelligent scalable image watermarking robust against progressive DWT-based compression using genetic algorithms
Image watermarking refers to the process of embedding an authentication message, called watermark, into the host image to uniquely identify the ownership. In this paper a novel, intelligent, scalable, robust wavelet-based watermarking approach is proposed. The proposed approach employs a genetic algorithm to find nearly optimal positions to insert watermark. The embedding positions coded as chr...
متن کاملStudying Knowledge Transfer with Weblogs in Small and Medium Enterprises: An Exploratory Case Study
متن کامل
Scalable Discovery of Contradicting Opinions in Weblogs
Weblogs are a popular means of information communication, where people discuss a variety of topics, and often times also express their opinions on these topics. In this work, we address the problem of analyzing the evolution of community opinions across time, as these are represented in the weblogs. In particular, we are interested in identifying topics and time windows, for which contradictory...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
- International Journal on Artificial Intelligence Tools
دوره 24 شماره
صفحات -
تاریخ انتشار 2015